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1.  Introduction 


1.1  Background 

Originally,  when  we  began  this  project  we  intended  to  develop  a  decomposition 
theory  for  implementing  parallel  algorithms  on  “standard”  mesh-connected  systolic  ar¬ 
rays.  We  believed  we  had  good  reasons  for  feeling  that  this  problem  was  a  highly  alge¬ 
braic  one,  and  that  various  ideas  could  be  used  to  arrive  at  systematic  design  pro¬ 
cedures.  We  still  agree  with  this  point  of  view,  but  the  emphasis  of  our  work  has 
changed  considerably,  for  reasons  which  I  hope  to  make  clear. 

There  were  two  problems  which  we  hoped  to  solve  by  our  algebraic  approach.  The 
first  is  the  problem  of  sorting  N  elements  by  pairwise  exchanges,  and  the  second,  that  of 
tridiagonalizing  a  symmetric  matrix  by  Givens  rotations.  (These  problems  are  discussed 
farther  in  Chapters  2  and  3,  respectively.)  While  we  were  able  to  find  an  optimal  solu¬ 
tion  to  the  first  problem  for  a  linear  array,  we  were  unable  to  do  so  for  a  rectangular 
one,  and  we  failed  similarly  for  the  tridiagonalization  problem.  From  these  experiences, 
it  became  clear  to  us  that  there  were  extremely  difficult  “constrained  parallel  complexity 
problems”  which  needed  to  be  solved.  That  is,  while  there  is  a  reasonably  well- 
developed  theory  of  computational  complexity  for  single  processor  computers  and  some 
theory  for  unrestricted  parallel  computation,  there  is  almost  no  theory  for  the  complex 
computations  subject  to  communication  constraints  imposed  by  a  processor  network. 

It  is  our  hope  that  the  concepts  which  we  are  now  exploring  will  help  to  understand 
problems  of  parallel  computation,  especially  ones  which  arise  in  parallel  architectures 
with  limited  inter-processor  communication.  We  would,  of  course,  hope  to  develop 
provably  optimal  architectures  with  such  a  theory,  but  this  seems  to  be  extremely 
difficult  for  all  but  the  simplest  of  problems,  so  for  now  our  work  focuses  on  developing 
“better”  architectures.  First,  we  are  searching  for  a  reasonable  framework  in  which  to 
formulate  a  version  of  such  a  constrained  theory  of  complexity  and,  in  particular,  for  a 
class  of  communication  networks  general  enough  to  contain  optimal  or  near  optimal  net¬ 
works  for  standard  problems  such  as  sorting,  matrix  multiplication,  FFT  computation, 
and  for  which  the  fabrication  costs  are  acceptable.  It  is  this  which  has  led  us  to  propose 
Cayley  networks  (see  Chapter  2)  as  a  possible  class,  but  we  feel  that  it  is  far  too  soon  to 
have  any  confidence  that  this  is  a  final  choice.  Second,  we  are  attempting  to  formulate 
mathematical  questions  which  are  equivalent  to  determining  lower  bounds  on  the  time 
complexity  of  the  sorting  and  tridiagonalization  problems  for  a  given  class  of  arrays,  and 
to  solve  these  problems  for  mesh-connected  arrays.  Third,  we 


are  attempting  to  use  the  insights  gained  in  our  theoretical  work  to  develop  architec¬ 
tures  to  implement  various  algorithms  in  efficient  ways.  Finally,  in  the  course  of  all  this 
work  we  are  attempting  to  develop  tools  to  aid  ourselves  and  others  in  future  work  in 
these  areas;  our  major  such  tool,  so  far,  is  our  high-level  simulation  language  MHDL, 
which  is  still  in  its  earliest  stages  of  development. 

1.2  Chapter  Summaries 

Preliminary  Concepts:  Chapter  2  gives  the  definitions  and  notations  to  be  used 
throughout  the  report.  In  an  attempt  to  be  at  least  formally  self-contained,  we  shall  give 
definitions  of  a  number  of  elementary  mathematical  structures,  such  as  groups  and 
graphs,  but  we  shall  provide  relatively  little  intuition  for  many  of  the  concepts  we 
define.  We  will  attempt  to  give  adequate  references  for  all  concepts  which  we  use,  how¬ 
ever. 

Matrix  Tridiagonalization:  Chapter  3  concerns  the  problem  of  tridiagonalizing  a 
real  symmetric  matrix  by  using  Given’s  rotations  acting  in  parallel.  We  derive  a  lower 
bound  for  a  class  of  TD  algorithms. 

Regular  Graphs  and  Cayley  Graphs:  Chapter  4  contains  some  of  the  results  we  ob¬ 
tained  in  our  study  of  graphs.  In  particular,  we  describe  the  results  of  our  heuristic 
search  methods  for  regular  graphs,  and  introduce  the  concept  of  a  Cayley  graph. 

MHDL:  Finally,  in  Chapter  5  we  describe  the  Modular  Hardware  Description 
Language.  This  is  a  high-level  simulation  language  which  we  have  defined  for  testing  al¬ 
gorithms  for  our  general  class  of  Synchronized  Modular  Networks.  While  this  language 
is  currently  working  “as  advertised,”  it  is  very  much  under  development  and  subject  to 
drastic  change  without  notice. 

1.3  Remarks 

Some  features  of  the  organization  of  this  report  seem  worth  comment.  Rather  than 
have  an  index  of  notation,  as  is  more  typical  of  mathematical  work,  we  have  included  a 
list  of  symbols  under  the  heading  “Notation"  in  the  regular  index.  Also,  instead  of  a  sin¬ 
gle  bibliography,  each  chapter  has  a  closing  section  containing  references.1 2  Consequently, 
any  citation  within  a  chapter  is  to  one  of  the  references  listed  at  the  end  of  the  chapter. 
Finally,  it  may  seem  to  the  reader  that  it  is  rather  absurd  to  call  the  major  divisions  of 
such  a  short  document  “Chapters";  rest  assured  that  it  does  to  us,  as  well,  but  for  rea¬ 
sons  too  boring  to  mention  here  it  was  expedient. 


1  Thu  it  due,  principally,  to  having  the  various  chapters  written  by  the  different  authors 

2 


2.  Preliminary  Concepts 


2.1  Introduction 

This  chapter  introduces  the  principal  mathematical  concepts  of  this  report.  As  this 
document  is  intended  to  serve  as  a  basic  reference  for  our  later  work,  concepts  will  occa¬ 
sionally  be  introduced  somewhat  baldly,  and  in  some  cases  receive  no  further  develop¬ 
ment  here. 

At  the  most  abstract  level,  we  are  interested  in  developing  a  theory  for  a  certain 
class  of  automata,  which  we  call  synchronized  modular  networks,  or  SMNs.  These  auto¬ 
mata  are  defined  in  Section  2.2,  but  aside  from  some  trivial  results  which  we  prove 
there,  no  further  mention  of  these  objects  will  be  made  here.  However,  being  mathema¬ 
ticians,  we  need  to  know  precisely  what  we  are  talking  about,  even  when  it  is  irrelevant, 
so  we  present  these  definitions. 

Section  2.3  defines  various  types  of  combinatorial  graphs  which  were  studied  as  pos¬ 
sible  communication  networks  and  parallel  computation  architectures.  None  of  these 
ideas  is  new  with  us  and  any  originality  in  the  presentation  is  the  result  of  the  ubiqui¬ 
tous  random  processes  which  influence  all  our  lives. 

Section  2.4  discusses  briefly  the  concept  of  communications  compatibility  between 
architectures  and  algorithms,  and  illustrates  this  concept  with  some  examples  drawn 
from  parallel  sorting  problems. 

2.2  Synchronised  Modular  Networks 

The  concepts  discussed  here  regarding  Synchronized  Modular  Networks  are  drawn 
from  the  theory  of  automata.  They  are  given  here  simply  to  provide  a  precise  basis  for 
the  rest  of  our  mathematical  work,  rather  than  as  a  starting  point  for  our  investigations, 
as  we  have  been  unable  to  prove  much  of  interest  within  this  very  general  framework. 

Definition  2.2.1:  An  abstract  machine  M  is  a  quintuple  ( I, X, 0,6,0 )  where 
/-  Ax  •  •  •  x/ljf  , 

0  —  0,X  •  •  •  XOgM  , 

6.1XX-+X, 
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and 


fi:X  -»  O. 

/,  O,  and  A'  are  referred  to  as  the  input,  output,  and  state  sets,  respectively,  of  M.  6  is 
the  next  state  or  state  transition  function,  and  $  is  the  output  function.  Intuitively,  the 
machine  cycles  as  follows: 

A/  is  in  some  state  receives  input  a,,  moves  to  internal  state  =  *2,  outputs 

/?(72)  and  waits  in  state  ^  for  more  input. 

We  say  state  y  is  reachable  from  state  x  if  there  exist  a,,  .  .  .  ,a,f/,  x,,  .  .  .  ,x %tX  with 

x  =  7,,  ji+,  =  ^a„/,),  for  »=»  1 . n— 1  and  »=  ^a,,*,,).  Assume  all  states  are  reachable 

from  one  another,  and  we  shall  suppose  that  A'  has  a  designated  1,  which  we  shall  call 
the  initial  state. 

Also,  we  associate  with  a  machine  Af  the  sets  l,  ,»t- }  and 

l0  =  [  oi.  ■  ■  ■  },  which  we  call  the  set  of  input  and  output  leads,  respectively.  If  we 

have  some  indexed  family  Ma  of  machines,  the  leads  will  be  denoted  l/(a),  .  .  .  ,  o;(a),  or 

h a  >  (f,ui  °;,o«  ^c. 

For  our  discussion  we  assume  the  existence  of  some  set  P  of  primitive  machines,  or 
processing  elements  (PEs). 

Definition  2.2.2:  Let  M  be  some  set  of  abstract  machines.  Then  N  —  is  a 

synchronized  Af-modular  machine,  or  an  SMN  of  Af-type,  iff  the  following  hold: 

a.  I  is  a  finite  index  set. 

b.  P.  /-A/. 

c.  lE  and  0E  are  input  and  output  sets,  respectively.2 

d.  r  is  a  bijeetion  where 

such  that  the  intersection  of  c(//(E))  with  l^E)  is  empty. 

We  denote  by  ca  the  projection  of  c  into  lj(a),  and  c0  the  projection  of  c  into 

'o(  fil¬ 
er  has  a  natural  extension  from  leads  to 


We  use  this  map  without  further  comment. 

*  That  it,  it  w e  said  before,  f  =  /fx  •  ■  X/fv  ft  c  The  Itadt  of  and  0E  are  denoted  /;(£),  O^e),  etc  We  thall  at- 

tame  that  E,l  are  oot  in  / 
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Proposition  2.2.1:  If  N  is  an  SMN  of  A#>type,  then  N  is  an  abstract  machine. 

Proof:  We  associate  TV  with  the  machine 

(I*,  x  Xa,  0s ,6ff,()N), 

a 

where*  £*(*)  =  c0(<^(*£,)>).  Next,  if  and  <*>  belong  to  X=*  xXot  and  if  Sa  is  the 

a 

state  transition  function  for  Ma,  then 

S^aB,<x>)a  -  Sa(ea{aB,<Pa(xa)>)). 

Definition  2.2.3:  If  P  is  a  set  of  machines,  then  P'  denotes  the  set  of  SMNs  of  P- 
type. 

Proposition  2.2.2:  P"  =  P'. 

Proof:  It  follows  directly  from  the  definition  and  the  previous  proposition. 

Remarks: 

a.  We  generally  assume  that  we  inhabit  the  universe  of  SMNs  of  P-type  for  some 
fixed  P,  and  suppress  all  mention  of  P. 

b.  We  refer  to  a  conjunction  of  two  SMNs,  Af,  and  M2,  as  an  SMN  formed  by  con¬ 
necting  some  outputs  of  Af2,  and  vice  versa.  A  cascade  of  M,  with  Ms  is  a  conjunc¬ 
tion  where  no  outputs  of  are  connected  to  Mv 

e.  It  is  obvious  that  every  SMN  with  n  PEs  can  be  formed  by  the  conjunction  of  an 
n— l  element  SMN  with  a  single  PE. 

2.3  Graphs  and  Networks 

Defining  a  number  of  standard  concepts  from  the  theory  of  combinatorial  graphs, 
we  use  them  to  relate  to  problems  in  networks  in  this  and  the  next  section.  As  these 
concepts  are  very  abstract,  the  mathematical  questions  about  networks  which  can  be 
formulated  in  terms  of  them  are  much  simpler  than  the  actual  details  which  would  arise 
in  practical  implementation  of  a  network  of  synchronized  processors.  However,  these 
questions  are  already  extremely  difficult  in  many  cases,  and  in  some  sense  we  feel  that 
their  generality  helps  to  ensure  their  usefulness.  Thus,  while  intelligent  readers  will 
doubtless  see  many  ways  in  which  these  formulations  are  inadequate  in  helping  to 
understand  real  parallel  processing  networks,  we  hope  that  they  will  feel  the  solution  of 
problems  posed  in  this  report  will  be  of  real  use  in  gaining  such  understanding. 

We  begin  by  defining  a  graph  (also  known  as  an  undirected  graph).  A  graph  r  is  a 
pair  (V,£),  where  V  is  a  set  of  points  known  as  vertices  or  nodes,  and  E,  the  edge  set,  is  a 
set  of  (unordered)  pairs  of  of  nodes  { v,w ),  where  v,w  belong  to  V.  We  think  of  the  edge 

'  Notice  that  w*  mike  in  of  tko  fact  tkot  tko  preimife  of  Cq  ie  a  eobeet  of  y  lo(a) 

a 
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as  being  a  line  segment  connecting  the  points  v  and  w,  and  we  say  that  this  edge  is 
incident  to  v  (and  ie).  This  edge4  will  usually  be  denoted  vw. 

The  degree  of  a  node,  or  vertex,  v  in  r  is  the  number  of  edges  in  r  incident  to  v.  If 
all  nodes  in  r  have  the  same  degree,  we  say  that  r  is  regular. 

A  path  of  length  n  from  z  to  y  is  a  sequence  of  vertices  «t>—{v0,  •  •  •  ,c«}  in  r,  so  «„=*, 
v„=y,  and  so  for  each  i,  v,vt+l  is  an  edge  of  r.  Clearly,  this  is  the  same  as  a  sequence  of 
wires  connecting  the  processors  corresponding  to  z  and  y.  (It  might  be  that  t>,  and  t\+2 
are  the  same,  corresponding  to  a  path  which  doubles  back.  We  do  not  consider  the  case 
where  v,  and  t>1+1  are  the  same;  that  is,  we  don’t  admit  the  case  of  an  edge  connecting  a 
vertex  to  itself.) 

The  distance  between  distinct  nodes  v  and  w  is  the  minimum  length  of  paths 
between  them.  (The  distance  from  v  to  itself  is  taken  to  be  zero.)  We  denote  this  dis¬ 
tance  by  d(z,y).  By  the  diameter  of  the  array,  we  mean 

A(r)  =  max  rf(z,y). 

t.<it  r 

where  the  max  is  taken  over  all  pairs  of  vertices  in  r. 

We  now  want  to  define  a  restricted  type  of  graph  which  is  of  interest  partly  because 
it  possesses  a  high  degree  of  regularity.  First,  however,  we  give  the  definition  of  a  type 
of  algebraic  construction  known  as  a  group. 

Let  G  be  a  set  of  points,  or  elements,  and  let  x  be  an  operation  which  takes  or¬ 
dered  pairs  of  elements  (a, 6)  to  a  third  element,  denoted  axi  or  simply  ab.  We  say  that 
there  is  a  distinguished  element  known  as  the  identity  element,  denoted  by  e,  provided 
ex  a  =  axe  =  a,  for  all  a  in  G.  We  say  that  a  has  an  inverse,  denoted  by  a*1,  provided 
a~' x a  =  axa~‘  =  e.  We  say  that  x  is  associative  provided  ax(bxc)  —  (axi)xc,  for  all 
a,  b,  and  c  in  C. 

A  group  is  a  set  G  and  an  operation  x  such  that  x  is  associative  and  every  element 
of  G  has  an  inverse. 

Some  of  the  simplest  examples  of  groups  are  the  cyclic 5  groups,  denoted  Z/n,  which 
are  the  integers  {0, .  .  .  ,n-l)  with  the  operation  being  addition  modulo  n. 

Definition  2.3.1:  A  Cayley  graph  is  a  graph  constructed  as  follows:  Let  the  point  set 
of  the  graph  G  be  some  finite  group,  also  denoted  by  G.  Let  ft  be  some  set  of  elements  of 
G  where  any  element  in  G  can  be  written  as  the  product  of  elements  of  ft  (that  is,  ft  gen¬ 
erates  G),  and  ft-1  =  ft.  Then  every  element  g  in  G  is  connected  to  all  elements  of  the 
form  ug,  where  w  is  in  ft,  and  only  those  elements.  We  denote  the  Cayley  graph  associ¬ 
ated  with  the  pair  (G,ft)  by  r(G,ft).  It  is  easy  to  see  that  such  a  graph  is  of  a  type  known 
as  vertex  transitive  (defined  below),  and  it  is  obvious  that  the  degree  is  the  number  of 

*  It  seems  worth  remarking  bore  that  one  could  make  all  there  definitions  with  ordered  pain  of  vertices,  giving  rite  to  what  n 
known  as  a  directed  graph  In  networks  where  the  low  of  information  is  non-symmetric,  jnch  a  formulation  would  aeem  called  for 
•The  operation  in  there  groups  is  known  as  "clock  arithmetic"  to  those  upon  whom  "the  NEW  math”  was  mlicted 
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elements  in  ft.  (Biggs  [3]  provides  extensive  information  about  such  graphs.) 

This  class  of  graphs  is  very  large  (it  is  of  course  infinite,  but  it  is  also  large  in  a 
more  meaningful  sense)  in  that  it  contains  a  number  of  examples  of  interesting  architec¬ 
tures,  and  many  more  can  be  obtained  by  simple  constructions  using  these  arrays.  As 
one  example,  we  show  in  Chapter  4  how  to  construct  a  class  of  networks  known  as  the 
Cube-Connected  Cycles  in  the  manner  described  above. 

Definition  2.3.2:  The  concept  of  the  symmetry  of  a  graph  is  precise.  Let  r  be  a  graph 
with  nodes  labeled  {l,...,n}.  Then,  a  permutation  r  of  the  integers  {l,...,n}  is  called  an 
automorphism  of  r  if  and  only  if,  for  all  »  and  j,  dO  is  connected,  or  adjacent,  to  r\j)  if, 
and  only  if,  i  is  connected  to  j.  Thus,  as  far  as  r  is  concerned,  »'  and  dO  “look”  exactly 
the  same  if  r  is  an  automorphism.  We  say  that  r  is  node,  or  vertex,  transitive,  provided 
for  all  nodes  i  and  j  there  is  an  automorphism  r  sending  i  to  j. 

2.4  Sorting 

We  now  relate  the  idea  of  data  dependence  for  an  algorithm  to  communication  con¬ 
straints  of  a  network  by  considering  methods  for  sorting  which  conform  to  the  commun¬ 
ication  restrictions  imposed  by  our  graph.  In  general,  a  sort  may  be  thought  of  as  a  per¬ 
mutation  p,  where  the  contents  of  the  *-th  node  are  sent,  ultimately,  to  the  />(>j-th  node. 
(Here  we  rather  naturally  suppose  the  it  was  desired  to  get  the  contents  of  the  *-th  node 
to  the  p(»)-th  node.  If  we  imagine  the  initial  contents  of  the  *-th  node  to  be  p(i),  then  our 
sorting  algorithm  is  in  effect  computing  the  inverse  of  p.)  Of  course  the  particular  per- 
ro-‘-\tion  “chosen  by  the  algorithm”  will  depend  on  the  initial  contents  of  various 
nodes,  i.e.,  the  initial  state  of  the  array.  We  shall  say  that  a  sorting  algorithm  is  con¬ 
sistent  with  a  graph  r  provided  it  generates  a  sequence  of  permutations  ph  where  for 
each  j, 

p(j)  =  W  •  •  •  PiM  (2.1) 

and  for  each  k  and  j,  p^j)  is  connected  to  j.  (That  is,  the  permutation  p  is  performed  in  k 
steps,  and  at  the  »>th  step  the  contents  of  the  >-th  node  are  sent  to  some  node  p,(j)  which 
is  connected  to  j.)  We  shall  consider  here  only  consistent  algorithms  where  each  of  the  pk 
may  be  taken  to  be  a  transposition  or  the  identity. 

We  say  that  the  time  complexity  of  a  sequence  of  permutations,  as  in  Equation  2.1, 
is  the  minimum  number  of  terms  into  which  p  can  be  decomposed,  where  the  terms  are 
of  the  form  p,  •  •  •  p+,  and  the  various  p,  in  a  given  term  commute.  (We  call  such  a  term 
a  time-step.  Since  permutations  commute  if,  and  only  if,  they  act  on  disjoint  sets,  a 
time-step  is  some  collection  of  permutations  which  can  be  performed  concurrently.)  Fi¬ 
nally,  the  time  complexity  of  an  algorithm  is  the  maximum  time  complexity  of  the  per¬ 
mutations  p  generated  by  the  algorithm,  where  this  maximum  is  taken  over  all  initial 
states  of  the  array. 


One  fact  which  is  completely  obvious  is  that,  by  this  definition,  the  time  complexity 
of  any  consistent  algorithm  is  at  least  as  great  as  the  diameter  of  the  underlying  graph, 
since  each  time-step  moves  the  value  at  the  »-th  node  over  at  most  one  edge  in  the 
graph.  Consequently,  the  sorting  problem  for  a  systolic  array6  with  N  elements  is  of  time 
complexity  at  least  Of/1/).  However,  the  best7  results  of  which  we  are  aware  sort  in  O(JV) 
and  in  particular  this  can  be  realized  for  a  wide  class  of  arrays. 

However,  it  is  not  enough  to  make  the  diameter  small,  as  consideration  of  a  2-tree 
quickly  shows.  The  difficulty  here  is  that  of  congestion,  i.e.,  many  shortest  paths  pass 
through  the  same  node;  every  path  from  the  left  half  of  the  tree  to  the  right  half  of  the 
tree  must  pass  through  the  root,  and  so  the  complexity  must  be  at  least  0(,V)  for  an  N 
element  tree. 

There  are,  of  course,  many  other  considerations  in  assessing  the  costs  of  various 
algorithms/arrays  for  sorting.  It  seems  very  desirable  for  the  algorithm,  and  hence  the 
array,  to  have  a  sufficiently  simple  regular  structure  so  a  theoretical  verification  of  the 
algorithm  would  be  feasible.  Also,  it  seems  discordant  to  the  spirit  of  distributed  pro¬ 
cessing  to  have  the  permutations  pk  to  have  general  dependence  on  the  current  state, 
especially  when  they  are  as  simple  as  transpositions.  In  this  last  case,  it  seems  most  na¬ 
tural  to  require  that  pk  depends  only  on  the  states  of  some  pair  of  nodes  ik  and  and 
that  the  particular  nodes  depend  only  on  the  stage  k.  With  these  restrictions,  the  best 
sorting  algorithms  known  require*  0(log2(/V)).  Connections  between  sorting  and  graph 
theoretic  problems  will  be  discussed  further  in  Chapter  4. 

2.5  Matrix  Tridiagonalization 

This  section  contains  definitions  for  concepts  needed  in  Chapter  3.  More  detailed 
information  on  these  topics  may  be  found  in  the  excellent  book  by  Parlett,  Reference 

[41- 


Definition  2.5.1:  Let  A  =  (al;)  be  an  nxn  real  symmetric  matrix.  We  say  that  A  is  tri¬ 
diagonal  if,  and  only  if,  a,;  =  0  for  |i-»  >  l.  Let  denote  the  plane  rotation 

(cos$  —  sin0| 
sin0  cos#)' 

Then,  notice  that  R2(0)~l  =  R^-fi)  =  R^ff)1  —  R^O)'.  In  n-space  let  Rn(ij,8)  denote  the  ro¬ 
tation  which  is  the  identity  on  the  orthogonal  complement  of  the  plane  given  by  the  i,  j 
basis  vectors,  and  is  R2(9)  in  this  plane.  Finally,  for  i,k  ^  j,  let  G(iJ,k)  denote  a  transfor¬ 
mation 

A  -*  R{i,j,6)AR(i,j,-$)  =  A', 


•  See  reference  |l|  for  i  definition  of  « yatalit  array 
T  See  Knutk  |2|,  Section  S  3  f 

•  Again,  jee  [2| 
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such  that  A'lk  =  o.  It  is  easy  to  see  that  G[ij,k)  is  unique  up  to  the  sign  of  9  if  j,  k  are 
distinct,  and  if  k=  i  then  £?(«,/,*)  is  given  by  9  in  {  t,  -<f>,  <t>-~  },  for  a  unique  <t>. 

4 &  L 

This  transformation  is  referred  to  as  a  Givens  rotation.  For  A  Hermitian,  G{iJ,k)  is 
defined  in  a  completely  analogous  fashion,  and  there  are  similar  types  of  simple  ambigui¬ 
ties  in  its  definition  exist,  although  9  may  now  be  complex.  See  Reference  [4j,  especially 
Chapter  6,  Section  4,  for  a  more  detailed  exposition. 
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3.  Matrix  Tridiagonalization 


3.1  Introduction 

The  increasing  availability  of  VLSI  (Very  Large  Scale  Integration)  devices  and  spe¬ 
cialized  computer  architectures  has  led  to  a  surge  of  interest  in  algorithms  which  utilize 
parallel  processing.  A  large  number  of  these  algorithms  has  been  directed  towards  solv¬ 
ing  classical  problems  in  linear  algebra  with  notable  successes  in  solving  linear  equations 
and  performing  matrix  operations  (References  [ll-[3j).  However,  the  computation  of  the 
eigenvalues  and  eigenvectors  of  symmetric  (ilermitian)  matrices  remains  an  area  in 
which  results  have  been  somewhat  less  effective  than  one  might  hope.  The  standard  ap¬ 
proach,  which  also  lends  itself  to  parallel  architectures,  is  tridiagonalization  followed  by 
the  QR  algorithm  ([i]-[7]).  The  major  stumbling  block  in  a  fast  implementation  of  such 
procedures  is  the  tridiagonalization,  which,  under  current  methods,  requires  a  time  of 
0(.\-)  to  reduce  an  NxN  matrix,  in  sharp  contrast  to  O(logA)8  for  each  iteration  of  the 
QR  algorithm  ([5]-[7]). 

Thus,  it  is  of  interest  to  establish  bounds  on  the  ability  of  parallel  processing  to 
speed  the  reduction  of  a  matrix  to  tridiagonal  form.  Unfortunately,  the  analysis  given 
here  is  of  a  very  restricted  nature,  as  it  assumes  that  the  algorithm  doesn’t  make  non¬ 
zero  any  entries  which  have  been  previously  set  to  zero.  However,  while  this  requirement 
may  seem  absurd,  many  of  the  proposed  methods  which  were  circulating  informally  at 
the  time  this  work  was  being  done  satisfied  this  condition.  Therefore,  while  this  analysis 
offers  little  for  the  general  case,  it  at  least  explains  why  those  proposed  methods  had 
such  poor  performance,  and  why  the  current  systolic  methods  which  work  in  0(1^*)  time 
are  of  such  different  character.  (See  Reference  [8].) 


3.2  Problem  Statement 

Let  .4  denote  an  A-dimensional  symmetric  (Hermitian)  matrix.  A  Givens  rotation 
G(i,,io,»  is  the  essentially  unique  rotation  in  the  plane  of  the  i,  and  ig  coordinate  axes 
which  results  in  A'v  =  o  ((4j).  Note  that  such  a  transformation  is  represented  by  an 
orthogonal  (unitary)  matrix  R  where  A-*A'  =  RAR~l.  The  effects  of  a  Givens  rotation  in 
creating  or  destroying  matrix  zeros  may  be  characterized  as  follows  (see  Figure  3.1): 

a.  Only  rows  »,  and  ig  and  columns  i,  and  ig  are  effected. 


•0(A)  for  systolic  »rr»ys 
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b.  A  zero  is  produced  at  element  Av  and  symmetrically  at  A,v 

c.  For  any  k  jt-  4,4,  if  —  0  prior  to  rotation,  then  they  remain  zero  after 

rotation.  (By  symmetry  this  also  holds  for  ,4^  =  0). 

Thus,  with  the  exception  of  cases  b  and  c,  all  elements  of  rows  and  columns  i,  and  4  in 
the  resulting  matrix  will  be  generically  non-zero. 


The  matrix  a 

1)  0 

X  X 

X  X 

0  X  X  X  X  X 

X  X 

0  X  X  X  X  X 


The  matrix  A' 

a  0 

0  X 

X  X 

0  0  X  X  X  X 

X  X 

0  X  X  X  X  X 


Figure  3.1  The  Givens  rotation  G(M.2).  Elements  represented  by  dots  are  unaffected;  those 
represented  by  XS  are  changed,  and  the  os  behave  as  shown. 


Next  consider  the  problem  of  reducing  a  symmetric  matrix  A  to  tridiagonal  form  us¬ 
ing  Givens  rotations.  We  also  wish  to  make  use  of  parallel  processing,  so  we  allow 
several  rotations  to  be  performed  simultaneously,  provided  they  involve  disjoint  sets10  of 
rows.  In  that  case,  the  corresponding  matrices  R  (equivalently  the  angles  of  rotation)  will 
be  independent  of  each  other,  and  no  element  of  A  will  be  affected  by  more  than  two 
Givens  rotations.  It  then  follows  that  with  sufficiently  many  processors  such  a  set  of  ro¬ 
tations  may  be  performed  in  0(1)  time.  Our  principal  result  is  the  following  theorem: 

Theorem  1:  Suppose  we  are  given  an  algorithm  for  tridiagonalizing  an  NxN  sym¬ 
metric  (Hermitian)  matrix  by,  possibly  concurrent,  Givens  rotations  as  described  above. 
If  the  algorithm  never  replaces  a  generic  zero  by  a  generic  non-zero,  then  it  requires  at 
least  0(NlogN)  time  steps.  Furthermore,  this  lower  bound  is  realizable. 

The  proof  of  Theorem  1  will  be  divided  between  the  next  two  sections. 

3.3  An  0(N  log  N)  Algorithm 

In  the  interest  of  simplicity,  in  Sections  3.3  and  3.4,  our  discussion  is  restricted  to  a 
description  of  the  elements  below  the  diagonal.  Since  the  matrix  is  symmetric  (Hermi¬ 
tian),  there  is  essentially  no  loss  of  generality.  The  stipulated  algorithm  proceeds  by 
zeroing  one  column  at  a  time  starting  with  j  «■  1  and  following  with  j  —  2,  etc.  It  is  clear 
from  property  (c)  that  if  nil  sub-tridiagonal  elements  Atl  are  zero  for  k  <  j,  they  will 
remain  zero  under  Givens  rotations  of  the  form  G[h,hj)  where  »„  4,  >  ;+l. 

to  More  formally,  a  colleetioa  of  rotation  (G(«lt4j)i  G(4>*4>t),...  >•  permitted  to  be  performed  ia  parallel  provided  tbe  eete 
(4.4).  (4.4).—  oreditjoiat. 
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We  next  show  that  column  j  may  be  zeroed  in  -l)  steps.  By  pairing  rows  j+j 

through  N  we  may  simultaneously  zero  half  the  sub*tridiagonal  elements  of  column  j 
(rows  j+ 2  to  A).  We  then  repeat  the  process  for  the  remaining  half  of  the  rows,  which 
still  have  non-zero  elements  in  column  j.  Continuing  in  this  fashion,  we  find  that  all  ele¬ 
ments  of  column  j  (i.e.,  A,,  for  i  >  j+  2)  have  been  reduced  in  g  steps  where 

n-j-2  <  + 


1  _  ti — - — r  <  1  -  — 

N  -  j  -  !  —  2’ 


Thus,  it  is  sufficient  that 


9  >  logs  (  N  -  j  -  1  ). 


(3.1) 


Finally,  summing  over  j  and  noting  that  g  must  be  an  integer  we  find  that  our  algorithm 
takes 


S*  ( lo*2  (  N  -  j  -  1  )  +  1  )  =  log,  (  /V  -  2  )!  +  N  -  1  (3.2) 

steps  which,  by  Stirling’s  formula,  is  0(AMog2A).  This  proves  existence. 

We  remark  that  such  an  algorithm  requires  communication  across  the  entire  matrix. 
If  we  restrict  ourselves  to  “local”  connections  as  in  systolic  arrays,  it  is  not  generally 
possible  to  achieve  the  same  speed.  Pipelining  still  enables  us  to  perform  Givens  rota¬ 
tions  in  O(l)  time  (References  (3)  and  [6]).  However,  it  is  not  difficult  to  see  that  if  only 
adjacent  matrix  elements  may  communicate  (i.e.,  only  rotations  of  the  form  G(»1,ii±l,.;j 
are  allowed),  there  is  essentially  only  one  algorithm  for  reducing  a  column.  It  must  start 
at  the  bottom,  and  proceed  up,  ending  at  G(j+l,j+2,j).  We  then  find  using 

Lemma  1  of  Section  3.4,  that  for  algorithms  with  “local  communication”  the  best  that 
can  be  done  is  0(A*)  time  steps.  (This  statement  should  not  be  taken  too  rigorously  since 
we  have  not  really  defined  the  term  “local”.) 

a.  b. 


X 

X 

X 

X 

X 

X 

0 

X 

X 

0 

X 

X 

0 

0 

X 

X 

X 

0 

X 

X 

0 

0 

0 

X  X 

0 

0 

0 

X  X 

0 

0 

0 

0  X  X 

0 

0 

0 

0  X  X 

The 

final 

state 

of  matrix  A. 

A  hypothesized 

previous  state 

Figure  3.2 
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3.4  Derivation  of  a  Lower  Bound 


To  complete  the  proof  of  Theorem  1,  we  establish  the  following  lemma: 

Lemma  1:  An  algorithm  of  the  above  type  (tridiagonalization  by  Givens  rotations) 
must  proceed  by  the  successive  annihilation  of  columns.  More  precisely,  if  the  algorithm 
is  to  take  a  minimal  number  of  time  steps,  and  not  make  generic  zeros  non-zero,  column 
j  must  be  placed  in  tridiagonal  form  prior  to  column  ;'+ 1. 

Of  course,  one  can  zero  elements  in  column  /  >  ;  prior  to  j,  but  this  lemma  states 
that  one  will  eventually  have  to  make  non-zero  some  element  of  column  /. 

We  first  note  that,  given  any  algorithm  which  employs  parallel  Givens  rotations,  we 
may  assume  the  existence  of  an  equivalent  algorithm  with  Givens  rotations  in  sequence. 
(Simply  order  the  concurrent  rotations  performed  at  each  step  in  an  arbitrary  manner.) 
The  proof,  then,  is  obtained  by  induction,  starting  from  the  tridiagonalized  matrix  and 
working  backwards.  To  motivate  this  method  let  us  consider  the  tridiagonal  matrix  pic¬ 
tured  in  Figure  3.2a.  The  final  zero  placed  by  the  algorithm  could  only  be  element  ,46>s. 
The  creation  of  any  other  zero  would  also  have  created  a  non-zero  element.  For  exam¬ 
ple,  zeroing  element  M4i,  of  Figure  3.2b  by  a  Givens  rotation  with  row  5,  <7(5,4, l),  creates 
a  non-zero  element  at  Me,i.  Alternatively,  the  use  of  row  3  destroys  (by  its  action  on 
column  3)  ASi.  Similarly,  the  use  of  row  2  destroys  Ata  and  A£2,  and  row  1  would  destroy 
A3>,  and  A6  l.  In  other  words,  we  conclude  that  the  last  stage  of  the  algorithm  must  have 
consisted  of  the  single  Givens  rotation  <7(4,5, 3)  using  row  4  to  zero  the  element  ASti.  We 
now  proceed  to  the  general  proof. 

Pf.  of  Lemma  1:  Suppose  that  at  some  stage  t  of  the  algorithm  we  have  the  following 
situation  below  the  diagonal  (where  j  <  N- 2):  columns  l  to  j  have  their  final  tridiagonal 
structure;  column  j+1  has  at  least  one  “generic”  zero;  and  columns  greater  than  ;+ 1  are 
arbitrary.  (This  situation  is  pictured  in  Figure  3.3  for  j  =  2  and  N  **  8.)  Then  the  previ¬ 
ous  stage  »-l  could  not  have  had  a  non-zero  off-tridiagonal  element  in  column  ;,  say  Ah 
with  /  >  ;'+2,  because  zeroing  that  element  by  a  Givens  rotation  would  have  involved  ei¬ 
ther 

a.  The  interaction  of  row  l  with  another  row  r  >  ;+ 2  which  would  create  a  non-zero 

entry  at  A,,  destroying  the  tridiagonal  structure  for  columns  1  to  j; 


or 


b.  The  interaction  of  row  l  with  a  different  row  r  <  j+  2,  which  implies  the  interac¬ 
tion  of  column  l  with  column  r.  This  interaction  would  create  a  non-zero  entry  at  A,t 
and  either  destroy  the  tridiagonal  structure  (if  r  <  ;+i)  or  remove  a  zero  from 
column  ;+l  (if  r  —  j'+l). 

These  aspects  are  illustrated  in  Figure  3.3  for  /  —  6.  It  now  follows  by  induction 
on  j  for  j  —  n— l,n— 2, . . .  ,t  that  column  ;-l  must  have  been  completely  reduced  pri¬ 
or  to  initiating  the  final  reduction  of  column  Thus,  an  algorithm  of  the  type 
specified  must  proceed  column  by  column;  the  reduction  of  any  matrix  elements 
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outside  such  a  sequence  results  in  creation  of  new  generic  non- zeros. 

To  complete  the  proof  of  Theorem  1,  we  note  that  in  reducing  column  j  we 
may  only  use  rows  j+l  through  N  (otherwise  we  introduce  non-zeros  in  the  column 
through  mechanism  (b)).  Similarly,  we  may  not  use  a  row  r  with  a  zero  in  the  j'th 
column  to  reduce  some  other  row  since  the  zero  An  will  be  destroyed.  Finally,  the 
condition  that  concurrent  Givens  rotations  be  performed  on  disjoint  pairs  of  rows 
implies  that  at  most  one-half  the  non-zero  entries  may  be  reduced  in  one  stage  of 
the  algorithm.  This  restricts  us  to  the  situation  of  Equation  (3.1);  i.e.,  0(log(Af-j-i)) 
steps  are  necessary  to  reduce  column  j.  Equation  (3.2)  then  implies  that  the  entire 
tridiagonalization  takes  at  least  0(N\ogN)  time  steps. 

a.  b. 
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A  possible  state  of  the  matrix  at  stage  «  of 
the  algorithm.  This  corresponds  to  the  case 
;= 2  of  the  text. 


Figure  3.3 


A  hypothetical  situation  one  Givens  rotation 
short  of  stage  »,  assuming  stage  i  created  a 
zero  at  This  is  impossible  as:  (1)  the  in¬ 
teraction  of  row  6  with  any  row  >/+2  creates 
a  new  x  in  column  2,  and  (11)  the  interaction 
of  row  6  with  row  r</+2  implies  interaction 
with  column  6  with  column  r  which  yields  a 
new  x  in  column  r. 


3.5  Remarks 

By  restricting  our  arguments  to  elements  below  the  diagonal,  we  have  implicitly  as¬ 
sumed  that  >  j  in  G(  .  If  this  restriction  is  relaxed,  we  find  that  the  last  Givens 
rotation  may  have  zeroed  either  A*,  (result  by  symmetry  of  applying  G(-,l,3))  or  ANtN_2. 
As  a  consequence,  the  reduction  may  actually  proceed  by  simultaneously  reducing  rows 
(bottom  to  top)  and  columns  (left  to  right).  The  number  of  time  steps  still  remains 
0{N  logiV),  however. 

The  astute  reader  may  also  have  noted  that  our  arguments  assume  the  elements  of 
the  two  subdiagonals  are  genericaliy  non-zero  and  justifiably  wonder  whether  some  ap¬ 
propriate  zeroing  of  these  elements,  or  others,  at  intermediate  stages  could  have  a  posi¬ 
tive  effect  on  the  algorithm.  That  this  is  so  has  been  shown  in  Reference  (8j,  by  R. 
Schreiber. 
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4.  Regular  and  Cayley  Graphs 


4.1  Introduction 

It  is  clear  that  communication  time  between  processors  is  one  of  the  most  severe 
limiting  factors  in  designing  high  speed  parallel  computers.  Under  these  circumstances, 
it  is  obviously  important  to  be  able  to  design  networks  of  processors  in  which  communi¬ 
cation  time  is  as  short  as  possible.  A  mathematical  version  of  this  design  problem  is  the 
problem  of  constructing  graphs  of  given  fixed  degree  and  number  of  vertices  with  small 
diameter.  (See  Chapter  2  for  definitions  of  these  terms.)  We  have  attacked  this  problem 
from  two  directions.  First,  we  have  constructed  a  heuristic  algorithm  which  finds  graphs 
with  small  diameter,  and  implemented  it  on  a  computer.  The  program  is  written  in  the 
C-language.  We  have  compared  the  results  with  previous  best  known  results.  Secondly, 
we  have  studied  a  collection  of  graphs  which  have  compact  and  systematic  descriptions, 
the  so-called  Cayley  graphs.  Routing  algorithms  for  these  graphs  is  easy  to  specify.  We 
have  given  a  crude  comparison  of  their  diameter  with  that  of  a  theoretical  bound,  and 
studied  a  specific  class  of  them,  the  “modified  cube-connected  cycles.’’ 


4.2  Summary  of  Results 

The  results  obtained  may  be  summarized  as  follows. 

a.  The  heuristic  algorithm  we  constructed  is  an  improvement  over  all  previous  algo¬ 
rithms  of  its  kind.  In  particular,  we  improved  many  of  the  best  known  values  for 
dense  graphs  of  given  degree  and  diameter.  A  complete  description  is  given  in  Sec¬ 
tion  4.2.2,  where  Figure  4.1  shows  all  the  improved  values  we  obtained. 

b.  Our  algorithm  does  not  find  the  densest  known  graphs  in  cases  where  they  are 
constructed  using  systematic  combinatorial  constructions.  This  suggests  that  one 
should  study  a  restricted  class  of  graphs  that  has  systematic  descriptions. 

c.  The  symmetric  groups  5,  admit  graph  structures  whose  diameters  approach  the 
theoretical  bound  (Moore  bound)  arbitrarily  well  as  n—  oo.  This  suggests  that  they 
should  be  studied  much  more  carefully  as  a  possible  source  of  efficient  communica¬ 
tions  networks. 
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d.  A  slight  modification  of  the  cube-connected  cycles  of  Reference  [1]  produces  an 
infinite  family  of  graphs  whose  diameter  grows  as  2Iog.>(A0,  where  K  is  the  number  of 
points  in  the  graph.  This  compares  favorably  with  5/2logs>(A),  which  is  the  diameter 
of  the  cube-connected  cycles. 

e.  A  layout  is  given  for  these  modified  cube-connected  cycles,  whose  area  is  3/2 
times  the  area  of  the  cube-connected  cycles  with  the  same  number  of  nodes.  The 
VLSI  measure  of  complexity,  At1,  is  thus  slightly  decreased  by  a  factor  of  24/25. 

These  results  are  substantiated  by  running  the  heuristic  program  which  we  have 
designed,  and  by  theoretical  analysis  contained  in  Section  4.2. 


4.3  Detailed  Description  of  Results 
4.3.1  Measures  of  Communication  Time. 

Throughout,  we  are  interested  in  arrays  of  “processors”  connected  by  “wires.”  The 
nature  of  these  processors  is  not  specified,  because  we  want  to  study  the  general  prob¬ 
lem  of  communication  time,  without  restricting  to  a  specific  situation.  We  formalize  this 
notion  by  considering  graphs  r,  where  the  vertices  of  the  graphs  correspond  to  proces¬ 
sors  and  edges  to  wires.  We  will  suppose  that  the  array  functions  in  such  a  way  at  every 
time  information  is  allowed  to  flow  along  one  wire.  The  time  required  to  move  along  one 
wire  is  presumed  to  be  constant.  By  the  distance  between  two  processors,  or  the 
corresponding  vertices  *  and  y ,  we  mean  the  length  of  the  shortest  path  in  r  from  *  to  y. 
(These  and  other  standard  terms  from  graph  theory  are  defined  in  Chapter  2.)  From  this 
point  on,  we  no  longer  speak  of  the  arrays  of  processors  but  only  of  their  corresponding 
graphs.  We  wish  to  study  the  problem  of  designing  graphs  with  a  given  number  of  ver¬ 
tices,  having  small  diameter.  Of  course,  with  no  constraints  on  the  graph,  this  is  a  trivi¬ 
al  problem  since  complete  graphs  all  have  diameter  1.  However,  technology  dictates  that 
the  number  of  edges  from  each  vertex  should  be  less  than  or  equal  to  some  finite 
number  d.  Accordingly,  we  consider  only  regular  graphs,  and  we  attempt  to  solve  the 
problem  of  minimizing  the  diameter  of  regular  graphs  of  degree  d  having,  say,  N  ver¬ 
tices.  There  is  an  a  priori  upper  bound  to  N,  given  d,  called  the  Moore  bound,  (see 
Reference  [2]),  which  is 

A<  1  +  4+ 4(4-1)  +  •••  +  4(4-  I)*'1 

where  k  is  the  diameter  of  the  graph.  This  says  that  asymptotically,  k  grows  at  least  as 
fast  as 


l°8<-»  (A) 


Id  N 
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It  is  known,  however,  that  this  bound  is  only  sharp  in  a  finite  number  of  cases  for  d  >  3 
(for  4=2,  the  cyclic  graphs  are  all  examples  where  it  is  sharp),  and  it  seems  generally  to 
be  rather  crude.  Our  efforts  toward  studying  this  problem  consist  of  the  construction  of 
a  heuristic  algorithm  (see  Section  4.2.2),  and  some  specific  constructions  derived  from 


group  theory  (see  Section  4.2.4). 

The  diameter  itself  is  only  a  weak  measure  of  the  effectiveness  of  the  processor  ar¬ 
ray.  Suppose,  for  instance,  that  the  processors  have  no  memory,  and  that  one  wishes  to 
transfer  information  simultaneously  from  the  vertex  v,  to  the  vertex  w„  for  i=»!, . . .  ,k. 
Since  there  is  no  memory,  one  must  produce  a  collection  of  paths  of  length  n  in  r, 
with  for  all  l <  i,j<  k,  and  so  that  t4')™ v„  Here,  Such  a 

collection  of  paths  is  called  a  i-separated  multipath  in  r  from  (ylf  . . .  ,vk)  to  (w,,  . . .  ,wk). 
We  define  the  ^-separated  distance  between  (u,, . . .  ,vk)  and  (wj, . . .  ,wt)  to  be  the 
minimum  length  of  all  A-separated  multipaths  from  (»,, . . .  ,vk)  to  (w,, . . .  ,wt)  in  r,  and 
denote  it  by 

*A[v\,  •  •  •  ,»*).( wj,  •  •  •  ,«>*))• 

The  Jfc-separated  diameter  is 

max  dk  (  («»„  •  •  •  •'•.«'*))*  A*(r) 

where  the  max  is  taken  over  all  pairs  of  Mupies,  with  ti,  jd  w,  unless  i=j. 

The  ifc-separated  diameter  A*(r)  takes  congestion  into  account,  and  so  is  a  more  sen¬ 
sitive  measure  of  effectiveness  of  the  processor  array.  However,  it  assumes  that  the  pro¬ 
cessors  have  no  memory.  We  will  define  another  measure  of  efficiency  which  as¬ 

sumes  that  each  processor  has  I  units  of  memory.  (Notice  that  this  is  not  equivalent  to 
having  memory,  but  is  simply  a  measure  which  attempts  to  take  into  account  some  pos¬ 
sible  improvements  in  algorithms  which  would  be  made  possible  by  having  memory.)  By 
a  (t,/)-separated  multipath  of  length  from  (v,, .  .  .  ,vt)  to  (u;,,  . . .  ,wk),  we  mean  a  Muple 
of  paths  of  length  n,  so  that 

a.  is  a  path  from  v,  to  w„ 

and 

b.  Each  of  the  Mupies  . . .  f^**)  contains  each  vertex  at  most  /  times. 

Note  that 

A*0\1)  -  A*(r), 

and  that  for  l  >  h,  A*(r,0  —  A(r).  Intuitively,  At(I\0  measures  the  time  required  to 
transfer  the  information  in  any  Muple  of  processors  to  any  other  Muple  of  proces¬ 
sors,  given  that  each  processor  has  l  units  of  memory. 

We  now  observe  that  for  any  graph  r,  A*(r,/)  is  just  the  diameter  of  a  graph  as¬ 
sociated  with  r.  For,  form  the  Mold  product  graph 

Tx  •  •  xr. 

» 

In  Tx  •  •  •  xr,  consider  the  full  subgraph  r»  on  the  set  Vic  Vx  •  •  •  x  V,  (here  V 
denotes  the  vertex  set  of  r),  where 


Vt  ™  .  .  .  ,t>t)|no  v}  appears  more  than  l  times  }. 

Then  it  is  easy  to  see  that 

Atir.J)  -  A(r't). 

Thus,  we  have  reduced  these  more  sophisticated  measures  of  effectiveness  to  a  di¬ 
ameter  question;  this  will  be  a  useful  reduction  in  view  of  the  algorithm  to  be  con¬ 
sidered  in  the  next  section. 

Unfortunately  the  diameter  is  often  rather  expensive  to  compute.  Consequent¬ 
ly,  one  would  like  to  obtain  some  less  sensitive,  more  easily  computable  invariants 
of  graphs  which  still  have  some  relation  to  the  diameter. 

Definitions:  A  cycle  is  a  path  from  some  given  node  to  itself.  The  girth  of  a  graph  r 
is  the  length  of  the  shortest  cycle  of  r  which  contains  no  repeated  edges. 

We  note  that  if  a  graph  has  diameter  k,  then  its  girth,  is  at  most,  2*+l.  Intuitively, 
the  girth  is  inversely  related  to  the  diameter.  For  a  fixed  number  of  points,  small  diame¬ 
ter  tends  to  imply  large  girth.  Let  us  summarize  the  facts  known  about  girth  relating  to 
this  problem. 

a.  There  is  a  lower  bound  for  the  number  of  points  in  a  graph  with  a  given  girth, 
analogous  to  the  Moore  bound  (see  Reference  [3]). 

b.  For  a  graph  of  odd  girth  j=2*+l,  and  degree  d,  the  number  of  points  is  at  least 

l  +  d+d(d-l)  +  ■■■  +d(d~  1  )*“*. 

This  bound  is  obtained  only  for  d  =  2,  or  for  d  =  3,7,  and  possibly  57,  with  g  —  h. 

c.  For  a  graph  of  even  girth  g  =  2k  and  degree  d,  the  lower  bound  is 

I  +  d+  D(  d-  1 )+  •••  +  d(  d-  1  )*-2  +  (  d-  1 

This  bound  is  known  to  be  attainable  only  for  g  “  =~  4,  6,  8,  or  12.  Also,  only  p*+i 
are  known  to  occur,  where  p  is  a  prime.  In  each  case  the  diameter  is  k.  These  graphs 
provide  by  far  the  optimal  graphs  for  the  diameter  problem  with  given  diameter  and  de¬ 
gree. 

We  define  certain  numbers  related  to  the  girth.  Given  a  vertex  «r,  we  define  N^z) 
to  be  the  number  of  vertices  connected  to  z  by  a  path  of  length  k,  and  define  t^F)  to  be 

min  At(z). 

«r 

Note  that  if  r  is  a  regular  graph  of  degree  d,  then  t^r)  is  bounded  above  by 
l+rf+4*-l)+  ‘  ‘  •  +4rf-l)*",“#‘i(^)i  and  that  the  girth  of  r  is  >/  if  t^r) for  all  k  <  // 2. 
Consequently,  to  maximize  girth,  one  should  attempt  to  maximize  successively 
t'afn.t'zCr),  •  •  •  /"dr) . In  each  case,  ^(r)  should  be  maximized  subject  to  the  constraint 


•  -  tr&- 
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that  one  remains  at  an  optimum  for  the  previous  values  of  the  subscript.  is  quickly 
computed  for  small  values  of  fc,  and  the  are  another  more  computable  collection  of 
invariants  of  graphs,  which  are  related  to  the  efficiency  of  the  associated  array  of  proces¬ 
sors.  This  is  particularly  useful  in  attempting  to  work  with  the  measures  A^r.O;  since 
the  graphs  r*  are  usually  quite  large,  the  time  spent  computing  invariants  is  of  primary 
importance. 

4.3.2  A  Heuristic  Algorithm. 

A  “hill-climbing”  algorithm  is  produced  here,  using  a  particular  heuristic  criterion 
to  find  graphs  of  a  given  fixed  degree  and  number  of  points  with  small  diameter. 

Let  r  be  a  graph  of  degree  A,  and  let  vi,v2,wltw2  be  vertices  of  r  so  that  »,t^  and 
are  edges  of  r.  Then  by  the  perturbed  graph  based  on  (t»1,v2,u>i,ui2),  we  mean  the  graph  f 
whose  vertices  are  the  same  as  those  of  r,  and  whose  edge  set  is 
(£V- Here  Ef  is  the  edge  set  of  r.  We  say  also  that  L  is  the  result 
of  a  perturbation  on  r.  These  modifications  are  precisely  the  X-changes  defined  in  Refer¬ 
ence  [4j.  r  is  regular  of  degree  A  if  r  is.  We  will  view  these  perturbations  as  “small” 
change  in  the  graph,  and  move  in  directions  which  improve  a  certain  functional  which 
we  define  below.  Graphs  will  be  encoded  by  their  “incidence  matrices.”  We  number  the 
vertices  of  the  graph  r,  by  {t>,(  . .  .  By  the  incidence  matrix  /(r)  we  mean  the  matrix 
(a,,),  where  a,,  =  1  if  v,v,  is  an  edge  of  r  or  «=;,  and  a„  =  0  otherwise.  One  useful  proper¬ 
ty  of  /( r)  is: 

The  (i,j>th  entry  of  7(r)*  is  the  number  of  paths  of  length  <  k  from  w,  to  v,  in  r. 

Consequently,  the  diameter  of  r  is  the  least  value  of  k  for  which  all  the  entries  of 
HDk  are  non-zero.  This  criterion  is  used  in  the  algorithm  to  compute  the  diameter,  since 
matrix  powers  are  readily  computable  by  a  machine. 

The  diameter  alone  is  itself  not  a  sufficiently  sensitive  invariant  for  purposes  of  the 
algorithm.  Specifically,  there  are  too  many  graphs  for  which  no  perturbation  results  in 
an  improvement  of  the  diameter.  Consequently,  using  only  the  diameter  as  a  functional 
to  be  optimized,  the  algorithm  is  frequently  unable  to  find  a  graph  with  even  reasonable 
diameter.  A  definition  is  needed  to  improve  matters.  Let  Z  denote  the  integers.  We 
wish  to  define  an  ordering  on  Z",  called  the  lexicographic  ordering.  For  n=l,  the  lexico¬ 
graphic  ordering  on  Z"*Z  is  just  the  usual  ordering  on  the  integers.  For  «>l,  we  suppose 
the  ordering  is  already  defined  for  all  m<n.  We  write  Z"*ZxZ"-1,  and  define  the  ordering 
inductively  by 


for  Now,  we  associate  to  every  graph  its  “diameter  vector."  First,  for  a  posi¬ 

tive  integer  t,  we  define  a^r)  to  be  the  number  of  zeros  in  the  /'*  power  of  /(r).  Of  course, 
if  />A(r),  aj(r)— o.  The  diameter  vector  is  now  simply  the  vector 
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(A(n  (-*),  «*-,  (r).  «t.2(r) . «2  (H ). 


We  will  denote  this  by  Hr).  We  order  these  vectors  as  follows: 

(A(r).  <**_,  (r),  (r) )  <  (Ain,  a*,.,  (D . a2  (D  ) 

if,  and  only  if, 


f  A(F)  <  A(r')  or  I 
(A(F)  =  A(r')  and  a(r)  <  a(r')j 

Here,  a(r)  denotes  the  vector  («t_,(r),  .  .  .  .a^r)),  and  the  ordering  is  the  lexicographic 
one. 

The  algorithm  now  proceeds  as  follows.  A  4-tuple  (i’,,t>2, u>, ,(«;,)  is  r  admissible  if 
and  u/,m2  are  edges  of  r,  and  and  are  not.  To  a  r-admissible  4-tuple,  we  may  as¬ 
sociate  a  perturbation  of  F,  as  defined  above.  From  a  fixed  initial  graph  T,  F-admissible 
4-tuples  are  generated,  and  the  associated  perturbations  are  applied.  This  continues  for 
a  large  number  of  steps,  until  the  initial  graph  is  presumed  randomized.  The  4-tuples 
are  generated  using  a  random  number  generator.  The  fixed  initial  graph  (in  the  trivalent 
case)  is  an  n-cycle  with  antipodal  points  connected.  After  this  is  done,  the  steps  are  as 
follows: 

a.  Select  an  r-admissible  4-tuple  at  random 

b.  Compute  Hf),  where  f  is  the  graph  obtained  by  applying  the  perturbation  asso¬ 
ciated  to  the  4-tuple  constructed  in  (a).  If  Hr)<  V(r),  set  F  =  r.  Repeat  step  (a). 

The  perturbations  are  selected  at  random,  since  it  was  found  that  a  simple  ord¬ 
ering  of  perturbations  tended  to  bias  the  algorithm  toward  particular  graphs. 

We  compare  our  algorithm  to  that  devised  in  Reference  (4).  Our  perturbations 
are  precisely  their  X-changes,  but  the  functional  we  optimize  is  much  more  sensi¬ 
tive.  Theirs  consists  only  of  A(r)  and  of  at_,(r). 

Summarizing  the  results  of  the  application  of  our  algorithm,  by  the  use  of  the 
algorithm,  it  has  been  possible  to  improve  substantially  most  of  the  densest  known 
graphs.  We  give  our  improved  version  of  the  table  constructed  in  Reference  (5).  d 
denotes  the  degree  of  the  graph,  k  the  diameter.  The  ( d.k )  entry  is  the  largest  known 
graph  with  diameter  k  and  degree  i.  Our  entry  is  listed  above;  the  parenthesized 
value  below  is  the  value  from  Reference  [5],  One  asterisk  indicates  that  the  graph  is 
provably  optimal.  Two  asterisks  indicates  that  it  is  obtained  from  Reference  [3]  us¬ 
ing  the  result  cited  in  Section  4.1. 

The  results  obtained  from  this  algorithm  are  in  some  cases  surprising.  Some  of 
the  qualitative  properties  we  observed  are: 
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a.  Many  graphs  obtained  by  random  generation  of  graphs  improved  values  in  the 
older  version  of  the  table  in  Reference  [5j  given  by  Storwick  [6].  This  suggests  that 
one  is  further  from  optima  than  was  previously  thought. 

b.  By  evaluating  the  eigenvalues  of  the  incidence  matrices  arrived  at  by  the  algo¬ 
rithm,  it  was  found  that  there  are  many  distinct  “local  minima”  (i.e.,  graphs  for 
which  no  perturbation  improves  the  diameter  vector)  for  the  diameter  vector.  This 
contradicts  the  suggestion  made  in  Reference  [4],  that  one  tends  to  arrive  at  a  glo¬ 
bal  optimum  from  all  starting  points.  It  seems  that  the  algorithm  in  Reference  [4] 
suffers  from  two  deficiencies.  First,  their  objective  functional  for  minimization  is  not 
sufficiently  sensitive,  as  we  observed  above.  Second,  their  perturbations  are  done  in 
fixed  sequential  order,  which  severely  skews  their  results.  We  have  overcome  this 
difficulty  by  randomly  selecting  the  perturbations  at  each  stage. 

c.  Although  our  algorithm  is  efficient,  it  seems  that  substantially  larger  networks 
could  be  studied  if  our  diameter  routine  were  modified  to  use  the  so-called  “Dijks- 
tra  algorithm,”  which  would  speed  up  the  diameter  calculation  substantially. 

d.  Although  the  algorithm  is  an  improvement  over  all  previous  heuristic  algorithms 
for  this  problem,  it  is  unable  to  find  many  known  dense  graphs,  arising  from  sys¬ 
tematic  constructions.  The  reason  for  this  seems  to  be  the  “denseness”  of  the  set  of 
local  optima  in  the  set  of  all  graphs  of  degree  d,  and  the  relative  sparseness  of  the 
so-called  vertex  transitive  graphs  therein.  It  seems,  therefore,  that  it  would  be  desir¬ 
able  to  design  an  algorithm  which  operates  entirely  inside  a  collection  of  vertex 
transitive  graphs,  possibly  with  the  Cayley  graphs  (see  Section  4.2.4).  Using  the 
Dijkstra  diameter  algorithm  and  an  efficient  description  of  many  groups,  such  an  al¬ 
gorithm  should  be  constructible.  Moreover,  it  would  allow  much  larger  networks  to 
be  studied,  since  the  diameter  calculation  for  vertex  transitive  graphs  is  substantial¬ 
ly  shorter  than  that  for  arbitrary  graphs. 
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Figure  4.1.  Densest  known  regular  graphs,  June  1983. 


4.3.3  Modification  of  the  Algorithm  for  More  Sensitive  Measures. 

In  view  of  the  remarks  in  Section  4.1  which  identify  the  measures  A[(r)  as  the  diam¬ 
eter  of  an  associated  graph  r*,  one  can  study  these  measures  in  principle  using  the  algo¬ 
rithm  discussed  in  Section  4.3.2.  However,  the  graphs  r't  are  usually  too  large  for  this 
procedure  to  be  practicable.  Our  current  implementation  of  the  algorithm  will  accept 
only  graphs  with  fewer  than  1000  points,  and  rj  usually  is  larger  than  this.  For  the 
measures  A'h  we,  therefore,  propose  the  use  of  a  “dual  algorithm”  based  on  girth,  which 
is  much  simpler  to  compute.  (See  Section  4.3.1.) 

The  modified  girth  algorithm  is  identical  to  the  previous  algorithm  except  that  the 
objective  functional  is  altered.  For  a  graph  r,  we  define  its  girth  vector  to  be 

7( n-MH.^r), . . .  ,^r),...). 

This  is  ordered  by  the  lexicographic  ordering,  and  the  algorithm  proceeds  just  as  before, 
except  that  we  now  accept  a  perturbation  if  it  increases  *r(r).  Applying  this  algorithm  to 
r*  should  produce  heuristic  results  which  improve  these  measures. 

4.3.4  Vertex-Transitive  Graphs. 
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Two  desirable  features  of  a  graph  to  be  used  as  a  processor  array  are  that  the 
description  be  as  simple  and  as  compact  as  possible.  Thus,  the  graphs  produced  by  a 
heuristic  algorithm  generally  will  not  be  satisfactory  from  this  point  of  view.  For  this 
purpose  it  would  be  useful  to  restrict  oneself  to  a  class  of  graphs  having  a  compact 
description. 

One  such  family  is  the  collection  of  so-called  Cayley  graphs.  (See  Chapter  2  for 
definitions  and  notation.)  As  an  example,  if  C=Z„,  the  cyclic  group  with  n  elements,  and 
n={r.r‘},  where  T  is  a  generator  of  G,  the  associated  graph  is  the  cyclic  graph  of  size  n. 
A  useful  property  of  r (G,ft)  is  that  its  automorphism  group  is  transitive  on  the  vertices. 
This  is  clearly  the  case  since  the  right  G-aetion  on  G  provides  an  action  of  G  on  the 
graph,  which  is  clearly  transitive  on  the  vertex  set.  This  is  a  useful  property,  since  it 
means  that  the  diameter  may  be  computed  by  finding  the  points  of  maximal  distance 
from  one  given  point.  Also,  routing  algorithms  for  these  networks  are  compactly 
described,  since  one  must  only  find  optimal  paths  starting  at  one  given  point. 

One  important  proposed  architecture,  the  cube-connected  cycles  of  Reference  [1],  is 
of  this  form.  In  fact,  if  G  is  the  semi-direct  product  Z/nx(Z/2)n,  where  if  T  is  a  generator 

P 

for  z/n,  p(T)(iu  .  .  .  ,xn )  =  (*„*!,  .  .  .  ,*„~i).  Thus,  G  has  elements  (m,t>),  where  mtZ/n, 
vi{Z)2Y,  and  (m,t>)(rr/V)  —  (m+m',p(m')e+t/).  It  is  an  easy  calculation  to  see  that  if  fl  = 
{(1,0),  (-1,0),  (O.eJ},  where  e,  =  (  I,  0,  0  0  ),  then  r(G,ft)  is,  in  fact,  isomorphic  to 

the  graph  associated  with  the  cube-connected  cycles.  The  diameter  of  the  cube- 

connected  cycles  is  known  to  grow  as  -^-logo(A),  where  K  is  the  number  of  vertices  in  the 

it 

graph. 

We  should  remark  here  that  large  girth  (and  hence  small  diameter)  in  Cayley 
graphs  is  associated  with  non-commutativity  of  the  group  in  question.  This  being  the 
case,  the  simple  groups  seem  to  be  natural  candidates  to  produce  efficient  graphs.  This 
is  proven  in  studying  the  diameters  of  r  (G,ft)  for  certain  choices  of  ft,  and  and 

observing  that  for  large  n,  they  approximate  the  Moore  bound.  The  order  of  symmetric 
group  Sn  is  nl  Let  QtcSn,  be  the  set  of  all  cycles  of  length  <k.  We  propose  to  compare 
the  diameter  of  r(S„,fit)  with  its  associated  Moore  bound.  First,  we  observe  that 

i°h;m;i+  ■  +Hii 

by  a  simple  counting  argument.  Thus,  the  degree  of  r(5„ftt)  is 


Thus,  for  large  n,  the  Moore  bound  for  r(S„,ftt)  is 


logj-i  (n!) 


In(n!) 
!n(rf  -  1)' 
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By  Stirling’s  formula, 


lim 

»-*00 


Id  (n!) 
nlnn 


=  1. 


so  for  large  n,  the  Moore  bound  is  approximated  by 

nlan 

In  (d  —  1)  ’ 


But.  again  for  large  n, 

In  (rf  -  1) -!■(-*  +  £  0  -  1)?  I  "1  ) 

/—I  1  '  1 

is  approximated  by  ,n[ * J ^  *,nn-  Hence,  for  large  n,  the  Moore  bound  for  r(5„nt)  is  ap¬ 
proximated  by  n/k.  The  diameter  of  r(5B>0*),  on  the  other  hand,  tends  to  n/fc-i,  as  one 
readily  computes  in  S „.  Consequently,  the  diameter  of  r(S„,nt)  is  within  a  factor  of 


of  the  Moore  bound.  As  k  becomes  larger,  we  are  able  to  approximate  close  equality  ar¬ 
bitrarily.  So  it  seems  that  5„  is  a  plausible  candidate  for  further  study. 

We  now  show  how  to  modify  the  cube-connected  cycles,  using  group  theoretic 
methods,  to  provide  an  infinite  family  of  vertex  transitive  trivalent  graphs,  whose  diam¬ 
eter  is  substantially  smaller,  but  which  has  all  the  desirable  regularity  properties  of  the 
cube-connected  cycles. 

Let  Cn=Z/nx(Z/2)",  as  before.  Note  that  Gn  contains  a  central  element,  namely  the 

P 

vectar  (0,(1,  1,  ...,  1)).  Following  our  intuition  concerning  the  relationship  between  non¬ 
commutativity  and  small  diameter,  we  eliminate  the  central  element  by  simply  factoring 
it  out.  Call  the  quotient  group  G„  and  let  fl  denote  the  image  of  n.  Then  we  claim  that 
the  diameter  of  r(G„,fl)  grows  as  2log2(*),  where  k  is  the  number  of  vertices,  an  improve¬ 
ment  over  the  diameter  obtained  for  the  cube-connected  cycles. 

Proposition:  The  diameter  of  r(G>,ft)  grows  as  2log ^k). 

Proof.  By  taking  inverses,  we  can  clearly  consider  the  graph 

r(5Jl), 

where  (j,$w)  is  an  edge  for  wen.  An  element  of  &„  is  given  by  an  ordered  pair  (m,v),  where 
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miZ/n,  »<V(2/2)7(l.*...l). 

The  multiplication  is  given  by 

(m,t>)et  =  (ffi.c+e,),  (m,v)T»  (m+l,/)(7’)t;).  and(TO1t’)r'1  =*  (m-l.pfT)-1*')- 

An  algorithm  for  expressing  (m,c)  in  terms  of  the  generators  T,  r~\  and  e,  is  described 

as: 

Let  1,(1)  denote  the  first  coordinate  of  a  vector  w  l'=(Z/2)\  For  any  veF,  one  may  lift 
v  to  an  element  v  of  V,  so  that  the  number  of  non-zero  coordinates  in  v  is  <n/2,  for  if 
one  lifted  v  does  not  have  this  property,  then  +  (1,  1,  1)  does.  Given  select  T> 

as  above. 

The  algorithm  now  proceeds  as  follows: 

a.  Initialize  a  counter  a  at  n— 1. 

b.  Is  xj(T)  =  0?  If  yes,  proceed  to  (d),  if  no,  proceed  to  (c). 

c.  Multiply  by  e,.  Proceed  to  (d). 

d.  Multiply  by  T,  and  decrement  a  by  1.  Proceed  to  (e). 

e.  Is  a  =  0?  If  so,  proceed  to  VI.  If  not,  return  to  (b). 

f.  If  m  jt-  0,  multiply  by  T\  where  q  is  the  number  of  minimal  absolute  value 

congruent  to  -to  mod  n.  Note  that  Quit. 

Since  v  has  at  most  n/2  Is,  we  only  multiply  by  e,  at  most  n/2  times.  Thus,  the 
total  number  of  steps  is  at  most  (r»-l)+n/2+n/2  =  2n-l.  For  odd  n,  it  is  at  most 
2n— 3,  which  is  of  the  same  order  as  21og^^^2,'",). 

If  one  forms  the  quotient  of  this  graph  by  the  equivalence  relation  w  (ni,v) 
for  all  m.mftZJn,  one  obtains  a  non-regular  family  of  graphs  whose  diameter  grows  as 

— log2(n),  which  is  comparable  to  that  obtained  in  Reference  (7),  and  for  which  the 

routing  algorithm  is  much  simpler.  Finally,  an  alternative  version  of  this  construc¬ 
tion  is  given  by  forming  the 


2« 


(n-l)-cube,  inserting  n-cycles  at  every  vertex  so  that  the  incoming  edges  each  con¬ 
nect  at  distinct  vertices,  and  connect  the  remaining  vertex  to  the  corresponding  ver¬ 
tex  for  the  antipodal  point  on  the  cube. 

4.3.5  A  Layout  for  the  Modified  Cube-Connected  Cycles. 

In  the  paper  Reference  [lj,  two  layouts  are  proposed  for  the  cube-connected  cycles, 
one  slightly  more  efficient  than  the  other.  By  combining  these  two  layouts,  we  obtain  a 
layout  for  the  modified  cube-connected  cycles.  The  area  of  the  layout  grows  as  3/2  times 
the  area  of  the  cube-connected  cycles  with  the  same  number  of  nodes,  and  has  commun¬ 
ication  time  roughly  4/5  times  that  of  the  cube-connected  cycles.  We  give  the  layout  for 
the  case  n=5,  corresponding  to  5-2*  =  80  nodes.  It  is  clear  from  the  diagram  (Figure  4.2) 
how  to  extend  to  the  general  case. 


Figure  4.2.  A  layout  for  the  modified  cube-connected  cycles 
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5.  Modular  Hardware  Description  Language 


5.1  Introduction 

This  chapter  describes  a  modular  hardware  description  language  (MHDL)  developed 
to  provide  an  easy  means  of  simulating  the  numerical,  and  other  high-level,  behavior  of 
novel  computer  architectures,  especially  those  proposed  for  real-time  signal  processing 
applications.  The  principle  goals  of  MHDL  are: 

a.  Easy  specification  of  elementary  building  blocks  (machines)  at  an  algorithmic 
level. 

b.  Automatic  reproduction  of  any  number  of  already  designed  modules  and  easy 
specification  of  interconnection  schemes  for  these  modules  to  produce  new  modules. 

c.  The  behavior  and  performance  of  the  resulting  machines  simulated  by  the  com¬ 
piled  MHDL  code. 

This  language  was  developed  on  and  for  computer  systems  using  UNIX" 
operating  systems.  While  it  could  be  modified  to  run  on  any  system  supporting  the 
C  programming  language,  only  its  use  on  UNIX  systems  is  discussed  here,  and  make 
use  of  programs  available  on  UNIX  with  little  or  no  comment. 

The  remaining  sections  of  this  chapter  give  a  brief  description  of  the  procedure 
for  installing  MHDL  on  a  system  (which  may  be  in  practice  less  than  automatic  due 
to  differences  in  C  compilers  even  among  “standard”  UNIX  systems),  and  for  com¬ 
piling  MDHL  programs.  We  also  describe  the  syntax  and  grammar  of  the  language, 
and  discuss  possible  future  improvements  of  the  language.  Examples  of  MHDL  pro¬ 
grams  and  source  listings  for  the  compiler  may  be  found  in  Reference  [!]. 

5.2  MHDL 

5.2.1.  Components  of  the  Compiler 

The  compiler  consists  of  the  following  files: 

lexer  .e 

This  is  the  source  file  for  the  lexical  analyzer  for  MHDL.  It  breaks  the  input  stream 
from  a  MHDL  program  into  tokens,  and  stores  all  identifiers  in  a  symbol  table. 
Although  lex  was  not  used  for  this  lexical  analyzer,  much  of  the  structure  of  the  lexical 

■>  UNIX  is  a  registered  trademark  of  Bell  Laboratories. 
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analyzer  remains  compatible  with  the  lex  environment. 

mhdlyacc 

This  is  the  source  file  for  the  MHDL  parser  and  code  generator.  It  receives  the  input 
stream  and  tokens  from  the  lexical  analyzer,  checks  for  any  syntax  errors,  and  generates 
appropriate  code  in  the  language  C.  Since  the  syntax  for  MHDL  is  very  straightforward, 
only  a  modest  number  of  error  messages  have  been  included  in  the  parser. 

ytab.c 

y.tab.h 

These  are  files  produced  when  yacc  is  run  on  mhdlyacc. 

declar.h 

This  contains  all  the  global  declarations  for  the  combined  program  of  lexer.c  with 
mhdlyacc. 

Makefile 

This  is  the  makefile  for  mbdl.  The  command  make  mhdl  will  create  the  file  mhdl 
which  contains  the  object  file  for  the  compiler.  Note  that  the  files  Makefile,  declar.h, 
mhydlyacc,  and  lexer.c  must  all  be  present  to  be  able  to  "make"  mhdl. 

mhdl 

This  is  the  object  file  for  the  MHDL  compiler.  It  is  produced  by  Makefile  using  the 
command  make  mhdl. 

xxmhdl.c 

xxmhdl.global 

xxmhdl.proced 

xxmhdl.declar 

yymhdl.c 

These  files  are  all  produced  when  mhdl  compiles  a  MHDL  program.  The  four  files  of 
the  form  xxmhdl.*  are  always  produced  by  mhdl.  The  file  yymhdl.c,  which  is  simply  a 
readable  version  of  the  xxmhdl  files,  is  only  produced  if  the  mhdl  compiler  detects  no  er¬ 
rors.  Note  that  if  the  C  compiler  finds  errors  in  the  program  it  is  much  easier,  if  not 
completely  necessary,  to  work  with  yymhdl.c  rather  than  the  xxmhdl.*  version. 

5.2.2.  Using  the  Compiler 

A  file  called  mhdl  is  needed  to  compile  a  MHDL  program.  If  this  file  does  not  exist 
on  your  system,  then  see  the  discussion  in  Section  2a  to  obtain  a  copy  of  this  file.  The 
steps  for  using  MHDL  are  the  following: 


SO 


a.  Enter  mhdl  MHDLyrogram  (a  MHDL  program  may  consist  of  several  files). 

b.  Correct  all  errors  reported  by  the  MHDL  compiler  and  repeat  step  a. 

c.  Enter  cc  yymhdl.c. 

d.  Correct  all  errors  reported  by  the  C  compiler  using  the  file  yymhdl.c  for  refer¬ 
ence,  and  repeat  steps  a,  b,  and  c. 

e.  Enter  a.out  or  a.out  <  datafile  depending  on  whether  you  wish  to  use  standard 
input  or  an  already  prepared  input  file.  (The  string  “ datafile ”  shouldn’t  be  inter¬ 
preted  as  a  literal.) 


5.3  Description  of  the  Language 


5.3.  J  Syntax  and  Grammar 

A  MHDL  program  is  made  up  of  a  sequence  of  blocks.  Each  block  may  be  any  one 
of  tne  following  types: 

Primitive  Module 

Module 

Global 

Procedure 

Connection  Scheme 

Configuration 

'1  lie  syntax  for  each  of  these  blocks  is  illustrated  below: 


Primitive  Module  <.module  name> 

Parameter  <.var.  decla.> 

Input  <var.  decU.> 

Output  <  var.  decls .> 
inout  <c«r.  dech.> 

State  <rar.  decla.> 

Uses 

Procedure  < procedure  name*> 
End  Uses 
JJehavlor 

<  V  cude> 

End  Primitive  Module  <  module  name> 
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Module  <module  name> 

Parameter  <var.  decls.> 

Input  <var.  decls.> 

Output  <.var.  decls.> 

Inout  <var.  decls.> 

Uses 

Primitive  Module  <prim.  mod.  name  (no.  of  times  used\> 
Module  <mod.  name  [no.  of  times  used[> 

Connection  Scheme  <  connection  scheme  name> 
Configuration  <  configuration  name> 

Procedure  < procedure  name> 

End  Uses 
Behavior 
<  C  code> 

End  Module  <.module  name> 


Global 

<  C  code> 

End  Global 


Procedure  <procedure  name> 

<C  procedure > 

End  Procedure  <procedure  name> 


Connection  Scheme  <  scheme  name> 

End  Connection  Scheme  < scheme  nome> 


Configuration  <  configuration  name> 

End  Configuration  <  configuration  name> 


ae  following  is  slightly  informal  Backus*Naur  form  for  the  grammar  for  MHDL. 
Literals  are  in  boldface.  Alternatives  are  separated  by  a  vertical  bar  *|  *.  A  group  that 
may  be  repeated  a  certain  number  of  times  is  enclosed  in  braces,  f{ 9  and  •} with  the 
number  of  repetitions  indicated  by  *+*  to  indicate  1  or  more  repetitions  and  a  f**  to 
indicate  0  or  more  repetitions.  Optional  terms  are  enclosed  in  *('  and  and  any 
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terms  starting  with  C-  are  meant  to  refer  to  the  corresponding  objects  in  the  language 

C. 


program  -*■  {  block  }  + 


block  -► 

{  prim-module-block  |  module-block  |  global-block  |  procedure-block  |  connect- 
scheme- block  J  config-block  |  whitespace  } 


prim-module-block  -♦ 

Primitive  whitespace  Module  whitespace  block-name  whitespace 

var-declarations 

State  C-code 

[  Uses  {  Procedure  identifier  }*  End  whitespace  Uses  j 
Behavior  C-code 

End  whitespace  Primitive  whitespace  Module 
whitespace  block-name 


module-block  -* 

Module  whitespace  block-name  whitespace 
var-declarations 

Uses 

{  {  {  Primitive  whitespace  Module  |  Module  } 

{  whitespace  identifier  [  C-code  ]  }+  }  | 

Procedure  whitespace  {identifier}+  j 
Configuration  whitespace  {identifier}-!-  | 

Connection  whitespace  Scheme  whitespace  {identifier}  +  }  + 
End  whitespace  Uses 
Behavior  C-code 

End  whitespace  Module  whitespace  block-name 


global-block  —  Global  C-code  End  whitespace  Global 
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i 
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procedure-block  — ► 

Procedure  whitespace  block-name  C-eode 

End  whitespace  Procedure  whitespace  block-name 


connect-scheme-block  -» 

Connection  whitespace  Scheme 
whitespace  block-name  C-eode 

End  whitespace  Connection  whitespace  block-name  C-eode 


config-block  -* 

Configuration  whitespace  block-name  C-eode 

End  whitespace  Configuration  whitespace  block-name 

block-name  -►  C-identifier 


var-declarations  -* 

[Parameter  C-variable-declarations ] 
[input  C-variable-declarations ] 
[Output  C-variable-declarations ] 
[Inout  C-variable-declarations ] 


whitespace  -*  {  C-whitespace  |  MHDL-comment  }+ 

MHDL-comment  -*  $  {any  character  except  NEWLINE  or  FORM-FEED}* 


5.3.2  MHDL  Semantices  -  How  MHDL  f runs'  a  Module 

The  Module  and  Primitive  Module  blocks  are  the  only  ones  in  this  version  of 
MHDL  that  have  nontrivial  behavior.  Configuration  and  Connection  Scheme  blocks  are 
unsupported  in  this  version  and  cause  an  error  message.  The  C-code  in  Global  and  Pro¬ 
cedure  blocks  is  copied  directly  to  sections  qf  the  produced  code  external  to  all  other 
procedures.  The  code  in  a  Global  block  is  guaranteed  to  appear  before  all  other  codes. 

The  two  kinds  of  modules,  primitive  and  non-primitive,  are  set  up  in  very  different 
ways.  For  primitive  modules  all  of  the  input  and  output  variables,  together  with  the 
state  variables,  are  put  together  as  one  structure  declaration.  The  number  of  times  this 
primitive  module  is  used  in  the  machine  being  described  is  counted  and  an  array  of  glo* 


bal  variables  is  declared  with  the  type  of  the  structure  just  created.  As  the  global 
machine  runs,  bookkeeping  is  done  to  keep  track  of  the  index  of  the  current  primitive 
module  that  is  actually  running.  Only  the  I/O  and  state  variables  for  that  particular 
copy  of  the  primitive  module  are  affected.  Running  a  primitive  module  amounts  to  exe¬ 
cuting  the  code  in  the  Behavior  section  exactly  as  it  appears,  except  that  all  I/O  and 
state  variables  are  preceded  with  an  array  structure  pointer. 

For  non-primitive  modules  all  of  the  input  and  output  variables  are  used  to  create  a 
global  variable  in  the  same  manner  as  what  was  done  for  primitive  modules.  Bookkeep¬ 
ing  for  the  current  running  copy  of  the  module  is  also  done  in  the  same  way  as  for  prim¬ 
itive  modules.  Running  a  non-primitive  module  should  be  thought  of  as  occurring  in  two 
steps.  In  the  first  step,  the  Behavior  section  of  the  module  is  executed  as  it  appears,  and 
has  this  module  to  various  inputs  of  its  submodules.  For  the  second  step,  the  Uses  de¬ 
claration  is  used  to  count  the  number  of  times  a  submodule  is  used  to  make  up  this 
module,  and  the  submodule  is  simply  run  that  many  times,  incrementing  the  index  of 
the  submodule  for  each  run.  There  is  a  distinguished  module  with  name  “Main”  that  is 
always  the  module  representing  the  global  machine.  The  program  created  by  MHDL  has 
as  its  only  task  the  running  of  Main  until  something  in  the  MHDL  program  causes  the 
program  to  terminate  (usually  caused  by  executing  a  exit()  in  one  of  the  primitive 
modules). 


5.4  Possible  Improvements 

The  present  version  of  MHDL  was  designed  as  a  prototype  and  as  such  many  possi¬ 
ble  extensions  or  changes  were  not  incorporated  until  more  experience  had  been  gained 
with  language.  The  following  list  of  changes  contains  features  that  the  designer  would 
most  like  to  see  improved.  Many  of  these  features  may  not  appeal  to  general  users,  and 
many  may  no  longer  be  appropriate  if  the  intended  use  of  this  language  should  shift. 

a.  Allow  levels  of  nesting  of  modules. 

The  current  version  allows  only  one  level  of  nesting,  and  this  is  clearly  too  res¬ 
trictive  for  general  use  on  larger  problems. 

b.  Allow  Global  Declarations  within  Modules 

The  language  should  promote  module  structure  by  not  forcing  the  user  to  put 
global  declarations  in  a  separate  block. 

c.  Increase  debugging  facilities 

(1)  Test  Connection  Structure  for  0  or  Multiple  Connections. 

(2)  Check  that  a  module's  I/O  variables  are  only  used  in  an  appropriate  way 
(e.g.,  that  Input  variables  are  only  used  on  the  right  sides  of  equations). 

(»)  Allow  easy  (or  automatic)  printout  of  the  values  for  I/O  and  state  variables 

IS 


to  aid  user  debugging. 

d.  Allow  Identifiers  to  use  any  Number  of  Significant  Letters. 

e.  Develop  Connection  Scheme  Concept  Beyond  linear  Numbering. 

f.  Allow  Multiple  and  Conditional  Calls  of  Modules. 

The  multiple  calls  would  be  useful  for  example  when  a  module  operating  at  the 
word  level  uses  a  module  operating  at  the  bit  level.  Conditional  calls  could  be  useful 
if  certain  electronic  characteristics  wished  to  be  simulated  at  the  MHDL  level  (rath¬ 
er  than  hidden  in  the  user’s  code). 

g.  Develop  Parameter  Concept  for  Modules. 

h.  Improve  Initialization  Facilities 

Several  types  of  initialization  are  now  awkward  within  MHDL  and  could  be 
greatly  improved.  Currently  all  I/O  and  state  variables  are  initialized  to  the  value 
0,  and  there  is  no  mechanism  for  other  initialization  values.  Along  the  same  lines, 
files  need  to  be  opened  for  use  and  currently  only  standard  input  and  output  are 
easily  accessed. 
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